Business task

1) Data:

Information about Australian Rainfall.

2) Business objective:

Whether there was at least 1mm of rain on the following day.

3) No information about cost of heavy rains.


Explore data

Target balance

Summary about target: target is slightly unbalanced.


Modeling approach

1) In question of ML terms - it's binary classification task:

class#1 – it will rain tomorrow (at least 1mm)  
class#0 – it won't rain tomorrow (less 1 mm).

2) Modeling objective: F1-score.

F1-score looks like trade-off, because there is no additional information about costs of FP/FN cases,                           and also target is slightly unbalanced.

Change formats and analyse missing values

Add calendar features

Plot feature "Rainfall" on high level (monthly aggregation)

Summary about Rainfall:

There were quite strong seasonal patterns of Rainfall in previous years (2009-2012) - with high season in December-January. But since 2013 the most peakes shifted into Apr-Jun. So one idea for future is using not all history of observations, for example to cut years before 2014.

Train/Test split:

1) 12 last months will be test set to test model's performance in different seasons.
2) previous 3 years will be train set which reflect latest trends.

Block1 numerical data

Block2 categorical data

Summary about data:

1) High attention to Humidity and RainToday - the most correlated to target. 
   Also, idea for future - exclude some redundant variables which are correlated between each other.
2) There is a difference betweeen locations. Idea for future - separated models for the top and the flop “rainy” groups.

Modeling

Handle missing values and transform variables

1) KNNImputer for numerical
2) SimpleImputer for categorical
3) RobustScaler to mitigate influence of potential outliers

Baseline

Choose the best model, finding hyperparameters on TimeSeriesSplit-validation of train

Evaluate the best model on test

Evaluate Naive model (RainToday) on test

Additional analysis of model

Summary about model:

1) LGBMClassifier shows the best results on test set.
2) There is significant improvement of Naive model: 50% better precision of LGBMClassifier, but quite many cases of rain       are still missed by model. 
3) Humidity, Pressure and Wind Speed are the most important features.

Next idea:

1) Data processing:implement knowledge about temporal-geospatial data distribution.
2) Feature engineering:lagged and rolling window features,geospatial features,the weather predictions of other models.
3) Modeling approches:DL models (Deep GPVAR,STCN),separated models for the top and the flop “rainy” groups of locations.